Skip to content

Conversation

@suket22
Copy link
Contributor

@suket22 suket22 commented Jan 9, 2026

Summary

Going through the platform connectors, I noticed that we drop health events if there's a failure writing into MongoDB. I think there's value in retrying across a few seconds when writing to MongoDB before we bail. This ensures that if there's a primary failover happening in the underlying data layer or a brief networking interruption, we're resilient to it. Fwiw, I don't think ring_buffer.go is an actual implementation of a ring buffer - it seems to be a thin wrapper around workqueues and effectively unbounded so you could still run into memory issues, so that's worth addressing too at a later point.

  • I've added tests that verify that we attempt to rewrite to MongoDB if there's any failure, and drop items after we've exhausted retries so there's no unbounded memory growth.
  • I let the retry mechanism be configurable so unit tests aren't slowed down when simulating multiple failures.

I think there's value in also making the same changes for the K8s connector since that's got the same issue IMO but I'll wait for feedback in this PR before doing that too.

In the long term, I think we should do the following -

  • The platform connectors should actually implement a ring buffer so they're not overwhelmed with a large number of findings reported by an underlying health monitor.
  • We could persist the health events to a WAL in conjunction with the in-memory ring buffer. That way we don't lose in-memory data on a pod restart etc.

I've added in some unit tests, but wasn't able to test this against an actual cluster yet. I'm doing that concurrently but figured I could get feedback in the interim.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: Platform Connectors

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Configurable retry/backoff for store writes, exposed via configuration and Helm values.
    • Ring buffer supports configurable retry timing and a rate-limited enqueue path.
  • Bug Fixes

    • Improved resilience: failed health event writes are retried with backoff up to the configured limit, then dropped to prevent infinite requeues; logs now include retry limits and outcomes.
  • Tests

    • Added tests covering retry, eventual success after retries, and drop-after-max-retries scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

Threaded configurable maxRetries into the DatabaseStoreConnector and initialization, added retry/drop decision based on ringbuffer.NumRequeues with rate-limited requeue via AddRateLimited, extended ringbuffer with configurable retry/backoff and requeue inspection, and added unit tests and Helm configuration for the new setting.

Changes

Cohort / File(s) Summary
Store connector (retry flow)
platform-connectors/pkg/connectors/store/store_connector.go, platform-connectors/main.go
Added maxRetries int field and constructor parameter; InitializeDatabaseStoreConnector updated to accept/propagate maxRetries (via config in main); on insert error compute retryCount from ringbuffer.NumRequeues and either call AddRateLimited if < maxRetries or call HealthMetricEleProcessingCompleted (Forget + Done) to drop when >= maxRetries; logs include maxRetries.
Ring buffer (retry/backoff API)
platform-connectors/pkg/ringbuffer/ring_buffer.go
Added configurable retry options (DefaultBaseDelay, DefaultMaxDelay, Option, WithRetryConfig); NewRingBuffer accepts options; added AddRateLimited(data) and NumRequeues(data); HealthMetricEleProcessingCompleted now calls Forget before Done.
Tests (retry/drop scenarios)
platform-connectors/pkg/connectors/store/store_connector_test.go
Updated calls to InitializeDatabaseStoreConnector to pass maxRetries; added TestMessageRetriedOnMongoDBFailure (fails twice then succeeds) and TestMessageDroppedAfterMaxRetries (always fails) validating retry/backoff and drop behavior and expected InsertMany call counts.
Kubernetes Helm config / values
distros/kubernetes/nvsentinel/templates/configmap.yaml, distros/kubernetes/nvsentinel/values.yaml
Added StoreConnectorMaxRetries to configmap template and platformConnector.mongodbStore.maxRetries: 3 to values.yaml to expose/configure maxRetries.

Sequence Diagram

sequenceDiagram
    participant SC as Store Connector
    participant MongoDB as MongoDB
    participant RB as Ring Buffer
    participant RL as Rate Limiter

    SC->>MongoDB: InsertMany(events)
    alt Insert Success
        MongoDB-->>SC: Success
        SC->>RB: HealthMetricEleProcessingCompleted(events)
    else Insert Failure
        MongoDB-->>SC: Error
        SC->>RB: NumRequeues(events) -> retryCount
        alt retryCount < maxRetries
            SC->>RB: AddRateLimited(events)
            RB->>RL: Enqueue with backoff
        else retryCount >= maxRetries
            SC->>RB: HealthMetricEleProcessingCompleted(events) (Forget + Done -> drop)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble on retries, count each little hop,
I push the events back until the backoff stops.
If failures fade, I cheer — the queue runs thin,
If hops exceed the cap, I tuck the message in.
A soft thump of code: retries kept neat, then on we hop.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding retry functionality for MongoDB write operations. It is concise, directly related to the primary objective, and reflects the core fix implemented throughout the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
platform-connectors/pkg/connectors/store/store_connector_test.go (1)

21-21: Unused sync import.

The sync.Mutex at line 318 is declared, and mu.Lock()/mu.Unlock() are called in the Eventually callback, but callCount is only used within that same callback and the mutex doesn't protect any shared state between the test goroutine and the FetchAndProcessHealthMetric goroutine. The mock client's Calls slice is accessed, which is already thread-safe via testify's internal synchronization.

Consider removing the unused mutex and the sync import to simplify the test.

♻️ Suggested simplification
 import (
 	"context"
 	"errors"
 	"os"
-	"sync"
 	"testing"
 	"time"

And in the test function (lines 318-319, 351-353):

-	var mu sync.Mutex
-	callCount := 0
...
 	require.Eventually(t, func() bool {
-		mu.Lock()
-		defer mu.Unlock()
-		callCount = len(mockClient.Calls)
-		return callCount >= 3
+		return len(mockClient.Calls) >= 3
 	}, 500*time.Millisecond, 10*time.Millisecond, "Should retry and eventually succeed")
platform-connectors/pkg/ringbuffer/ring_buffer.go (1)

118-121: Correct workqueue retry pattern.

The sequence of AddRateLimited followed by Done is the proper pattern for workqueue retries:

  1. AddRateLimited schedules the item for re-processing after the backoff delay
  2. Done marks the current processing attempt as complete, allowing the queue to process other items

Consider adding a brief comment explaining this pattern for future maintainers.

📝 Optional: Add clarifying comment
 func (rb *RingBuffer) AddRateLimited(data *protos.HealthEvents) {
+	// Re-queue with exponential backoff delay, then mark current attempt done
 	rb.healthMetricQueue.AddRateLimited(data)
 	rb.healthMetricQueue.Done(data)
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4038245 and ab004a6.

📒 Files selected for processing (3)
  • platform-connectors/pkg/connectors/store/store_connector.go
  • platform-connectors/pkg/connectors/store/store_connector_test.go
  • platform-connectors/pkg/ringbuffer/ring_buffer.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
  • platform-connectors/pkg/connectors/store/store_connector.go
  • platform-connectors/pkg/ringbuffer/ring_buffer.go
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (4)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (1)
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
platform-connectors/pkg/ringbuffer/ring_buffer.go (2)
  • NewRingBuffer (55-82)
  • WithRetryConfig (48-53)
store-client/pkg/client/interfaces.go (1)
  • InsertManyResult (95-97)
platform-connectors/pkg/connectors/store/store_connector.go (1)
  • DatabaseStoreConnector (42-48)
🔇 Additional comments (9)
platform-connectors/pkg/connectors/store/store_connector.go (2)

35-40: LGTM! Well-documented constant for retry cap.

The constant is appropriately documented, explaining both the purpose (preventing unbounded memory growth) and the resilience benefit. The value of 3 is reasonable for transient failures.


114-136: Retry logic correctly implements bounded retries with exponential backoff.

The flow is sound:

  • Initial attempt fails → NumRequeues returns 0 → retry
  • After 3 requeues, NumRequeues returns 3 → drop

Good inclusion of contextual information in the error log (node name, check name, event count) to aid debugging dropped events.

platform-connectors/pkg/connectors/store/store_connector_test.go (2)

307-362: Good test coverage for retry success scenario.

The test correctly:

  1. Configures short delays for fast test execution
  2. Sets up mock to fail twice then succeed
  3. Verifies 3 total InsertMany calls
  4. Confirms queue is emptied after success

364-414: Good test coverage for max retry drop scenario.

The test correctly verifies that:

  1. After 4 attempts (initial + 3 retries), the event is dropped
  2. The queue is emptied to prevent unbounded memory growth
  3. The expected number of InsertMany calls matches MaxRetries + 1

One minor observation: consider adding an assertion or comment to make the relationship between MaxRetries (3) and expected calls (4) more explicit for maintainability.

platform-connectors/pkg/ringbuffer/ring_buffer.go (5)

28-33: Good default retry configuration with clear documentation.

The timing values are well-chosen:

  • 500ms base delay provides quick recovery for brief transient failures
  • 3s max delay caps the backoff to avoid excessive waits
  • The comment explaining the progression (500ms → 1.5s → 3s) aids understanding

41-53: Clean functional options pattern implementation.

This enables configuring shorter delays in tests while maintaining production defaults, aligning well with the PR objective of making retry logic "configurable to avoid slowing unit tests."


55-82: LGTM! Constructor properly integrates configurable retry settings.

The implementation correctly:

  1. Initializes config with production defaults
  2. Applies any provided options to override
  3. Uses the configured delays for the exponential backoff rate limiter

108-111: Correct: Forget clears rate limiter state before marking done.

Calling Forget before Done ensures the rate limiter's failure count for this item is cleared, preventing stale retry metadata from accumulating.


123-125: LGTM! Correctly exposes requeue count for retry logic.

@lalitadithya
Copy link
Collaborator

/ok to test ab004a6

@lalitadithya
Copy link
Collaborator

I think there's value in also making the same changes for the K8s connector since that's got the same issue IMO but I'll wait for feedback in this PR before doing that too.

I believe for the k8s one we might already be using retry one error -- https://github.com/NVIDIA/NVSentinel/blob/main/platform-connectors/pkg/connectors/kubernetes/process_node_events.go#L88 which I believe should handle the retries for us.

The platform connectors should actually implement a ring buffer so they're not overwhelmed with a large number of findings reported by an underlying health monitor.

Yeah, it is a misnomer when we initially started off, we said we'd do a ring buffer, but we never ended up doing so since we didn't want to drop or overwrite events. If we do make a ring buffer we will likely also have to maintain a WAL like you mentioned so that we can keep the memory growth bounded and flush to disk at some cadence to not loose events.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

@suket22 suket22 force-pushed the sukets/retryMongoWrites branch 2 times, most recently from 36a2b21 to 582424c Compare January 9, 2026 23:42
@suket22
Copy link
Contributor Author

suket22 commented Jan 9, 2026

/ok to test 582424c

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @distros/kubernetes/nvsentinel/values.yaml:
- Line 115: Add an inline comment for the maxRetries field explaining what it
controls, its default value, acceptable range/units, and its behavior (e.g.,
whether it applies per request or overall retry attempts and how backoff is
handled); update the maxRetries line (the maxRetries key) to include a brief
one-line comment above or to the right that documents purpose, default (3), and
any limits or interaction with backoff/retry logic.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab004a6 and 582424c.

📒 Files selected for processing (5)
  • distros/kubernetes/nvsentinel/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • platform-connectors/main.go
  • platform-connectors/pkg/connectors/store/store_connector.go
  • platform-connectors/pkg/connectors/store/store_connector_test.go
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • platform-connectors/main.go
  • platform-connectors/pkg/connectors/store/store_connector.go
  • platform-connectors/pkg/connectors/store/store_connector_test.go
**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/values.yaml
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (1)
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (2)
platform-connectors/main.go (2)
platform-connectors/pkg/connectors/store/store_connector.go (2)
  • DatabaseStoreConnector (35-42)
  • InitializeDatabaseStoreConnector (58-80)
platform-connectors/pkg/server/platform_connector_server.go (1)
  • InitializeAndAttachRingBufferForConnectors (70-72)
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
platform-connectors/pkg/connectors/store/store_connector.go (2)
  • InitializeDatabaseStoreConnector (58-80)
  • DatabaseStoreConnector (35-42)
platform-connectors/pkg/ringbuffer/ring_buffer.go (2)
  • NewRingBuffer (55-82)
  • WithRetryConfig (48-53)
store-client/pkg/client/interfaces.go (1)
  • InsertManyResult (95-97)
🔇 Additional comments (9)
distros/kubernetes/nvsentinel/templates/configmap.yaml (1)

28-28: LGTM!

The config entry is correctly added to the Helm template and properly references the values.yaml field.

platform-connectors/main.go (1)

136-159: LGTM!

The config-based initialization properly reads and converts the maxRetries value, following the established patterns in this file for other configuration parameters.

platform-connectors/pkg/connectors/store/store_connector_test.go (3)

298-298: LGTM!

The test correctly updated to include the new maxRetries parameter.


305-357: Excellent test coverage for retry-on-failure scenario.

The test correctly validates that:

  • Failed writes are retried with exponential backoff
  • Queue eventually drains after a successful retry
  • InsertMany is called the expected number of times

The test setup with mock failures followed by success effectively exercises the retry logic.


359-411: Excellent test coverage for drop-after-max-retries scenario.

The test correctly validates the critical behavior that prevents unbounded memory growth:

  • Events are dropped after exceeding maxRetries
  • Queue drains even when all attempts fail
  • InsertMany is called exactly 4 times (1 initial + 3 retries) as expected with maxRetries=3

This test ensures the system won't accumulate events indefinitely during prolonged outages.

platform-connectors/pkg/connectors/store/store_connector.go (4)

41-41: LGTM!

The maxRetries field is appropriately added to the connector struct to support configurable retry behavior.


44-56: LGTM!

The constructor properly accepts and stores the maxRetries parameter.


58-80: LGTM!

The initialization function correctly accepts maxRetries, passes it to the constructor, and includes it in the success log for observability.


111-129: Excellent retry logic implementation with proper safeguards.

The implementation correctly:

  • Retries failed writes with exponential backoff via rate-limited requeue
  • Drops events after exceeding maxRetries to prevent unbounded memory growth
  • Provides rich structured logging for both retry and drop scenarios
  • Includes relevant event details (node name, check name) in drop logs for troubleshooting

The condition retryCount < maxRetries means with maxRetries=3, events are attempted up to 4 times total (1 initial + 3 retries), which aligns with the test expectations and is a reasonable interpretation.

@natherz97
Copy link
Contributor

/ok to test 582424c

@suket22 suket22 force-pushed the sukets/retryMongoWrites branch from 582424c to f75251e Compare January 12, 2026 18:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/values.yaml (1)

117-117: Add inline comment documenting the maxRetries configuration.

Per coding guidelines, all values in Helm chart values.yaml should be documented with inline comments. Consider adding a description explaining the purpose, valid range, and behavior (e.g., what happens when set to 0).

📝 Suggested documentation
   mongodbStore:
     enabled: false
     clientCertMountPath: "/etc/ssl/mongo-client"
-    maxRetries: 3
+    # Maximum number of retry attempts for MongoDB write operations on transient failures.
+    # Set to 0 to disable retries. Total attempts = 1 (initial) + maxRetries.
+    maxRetries: 3

Based on learnings and coding guidelines for **/values.yaml.

platform-connectors/main.go (1)

144-151: Consider validating maxRetries is non-negative.

The code correctly reads and converts StoreConnectorMaxRetries from the config. However, there's no validation that the value is non-negative before passing it downstream. A negative value from misconfiguration could cause unexpected behavior in the retry logic.

🛡️ Optional validation
 	maxRetriesFloat, ok := config["StoreConnectorMaxRetries"].(float64)
 	if !ok {
 		return nil, fmt.Errorf("failed to convert StoreConnectorMaxRetries to float: %v", config["StoreConnectorMaxRetries"])
 	}

 	maxRetries := int(maxRetriesFloat)
+	if maxRetries < 0 {
+		return nil, fmt.Errorf("StoreConnectorMaxRetries must be non-negative, got: %d", maxRetries)
+	}

 	storeConnector, err := store.InitializeDatabaseStoreConnector(ctx, ringBuffer, databaseClientCertMountPath, maxRetries)
platform-connectors/pkg/ringbuffer/ring_buffer.go (1)

113-116: Consider consolidating with HealthMetricEleProcessingCompleted.

HealthMetricEleProcessingFailed now has identical implementation to HealthMetricEleProcessingCompleted. If both are semantically equivalent (item done, no retry), consider consolidating or documenting when to use each.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 582424c and f75251e.

📒 Files selected for processing (6)
  • distros/kubernetes/nvsentinel/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • platform-connectors/main.go
  • platform-connectors/pkg/connectors/store/store_connector.go
  • platform-connectors/pkg/connectors/store/store_connector_test.go
  • platform-connectors/pkg/ringbuffer/ring_buffer.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • distros/kubernetes/nvsentinel/templates/configmap.yaml
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • platform-connectors/main.go
  • platform-connectors/pkg/ringbuffer/ring_buffer.go
  • platform-connectors/pkg/connectors/store/store_connector.go
  • platform-connectors/pkg/connectors/store/store_connector_test.go
**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/values.yaml
**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (2)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments

Applied to files:

  • distros/kubernetes/nvsentinel/values.yaml
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (2)
platform-connectors/main.go (1)
platform-connectors/pkg/connectors/store/store_connector.go (2)
  • DatabaseStoreConnector (35-42)
  • InitializeDatabaseStoreConnector (58-80)
platform-connectors/pkg/ringbuffer/ring_buffer.go (1)
node-drainer/pkg/config/config.go (1)
  • Duration (37-39)
🔇 Additional comments (13)
platform-connectors/main.go (1)

268-268: LGTM!

The call site is correctly updated to pass the config map, consistent with the pattern used for initializeK8sConnector.

platform-connectors/pkg/connectors/store/store_connector_test.go (3)

298-303: LGTM!

The test correctly passes the new maxRetries parameter to InitializeDatabaseStoreConnector.


306-357: Well-structured retry test.

The test correctly verifies retry behavior with exponential backoff. The WithRetryConfig option with short delays ensures fast test execution. The mock expectations (2 failures, then success) and assertion of 3 total calls correctly validate the retry mechanism.

Minor note: cancel() at line 356 is redundant with defer cancel() at line 310, but harmless.


359-411: Correct max-retries test coverage.

The test properly validates that events are dropped after exceeding maxRetries. The assertion of 4 InsertMany calls (1 initial + 3 retries) correctly reflects the expected behavior when maxRetries=3.

platform-connectors/pkg/connectors/store/store_connector.go (4)

35-42: LGTM!

The maxRetries field addition to DatabaseStoreConnector is clean and well-placed.


44-56: LGTM!

The constructor correctly accepts and assigns the new maxRetries parameter.


58-80: LGTM!

The initialization function correctly threads maxRetries through and logs it for operational visibility.


110-132: Retry logic is well-implemented.

The implementation correctly:

  1. Queries retry count before deciding to requeue or drop
  2. Uses rate-limited requeue for exponential backoff
  3. Logs sufficient context for debugging (retry count, max retries, event count, first event details on drop)
  4. Properly marks items completed when dropping to prevent unbounded memory growth

Note: When maxRetries=0, the condition retryCount < r.maxRetries (i.e., 0 < 0) is false, so retries are effectively disabled. If this is intentional, consider documenting it in the values.yaml comment.

platform-connectors/pkg/ringbuffer/ring_buffer.go (5)

28-33: Good default configuration.

The default delays (500ms base, 3s max) provide reasonable production defaults. The comment documenting the retry timing progression is helpful.


41-53: LGTM!

Clean functional options pattern with good encapsulation (unexported config struct, exported Option type and WithRetryConfig function).


55-82: LGTM!

The constructor correctly applies the functional options pattern with sensible defaults. The exponential failure rate limiter is appropriate for transient failure handling.


108-111: Correct lifecycle management.

Adding Forget() before Done() ensures the rate limiter's failure count is cleared when processing completes successfully. This prevents stale retry counts from affecting future operations.


118-125: LGTM!

The AddRateLimited method correctly sequences operations:

  1. AddRateLimited re-queues the item with exponential backoff delay
  2. Done releases the processing lock from the earlier Get()

NumRequeues properly exposes the workqueue's failure count for retry decisions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
platform-connectors/main.go (1)

144-151: Consider validating maxRetries is positive.

The config value is read and converted but not validated. A maxRetries of 0 or negative would be accepted, which could cause events to be dropped immediately without retry attempts if misconfigured.

🔧 Suggested validation
 	maxRetries := int(maxRetriesInt64)
+	if maxRetries < 0 {
+		return nil, fmt.Errorf("StoreConnectorMaxRetries must be non-negative, got: %d", maxRetries)
+	}
 
 	storeConnector, err := store.InitializeDatabaseStoreConnector(ctx, ringBuffer, databaseClientCertMountPath, maxRetries)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f75251e and e3ba68f.

📒 Files selected for processing (1)
  • platform-connectors/main.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • platform-connectors/main.go
🧬 Code graph analysis (1)
platform-connectors/main.go (1)
platform-connectors/pkg/connectors/store/store_connector.go (2)
  • DatabaseStoreConnector (35-42)
  • InitializeDatabaseStoreConnector (58-80)
🔇 Additional comments (1)
platform-connectors/main.go (1)

267-272: LGTM!

The call site correctly passes the config map to propagate the StoreConnectorMaxRetries setting. This follows the same pattern used for initializeK8sConnector.

@lalitadithya lalitadithya merged commit 0729af4 into NVIDIA:main Jan 14, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants